Nathaniel Dake Blog

4. Non Parametric Hypothesis Testing: KS-Score

Let's dig into non parametric testing, starting with why it is even needed in the first place.

In [1]:
import numpy as np
from scipy.stats import bernoulli, binom, norm, ks_2samp
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rc, animation
from IPython.core.display import display, HTML
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter

from _plotly_future_ import v4_subplots
from plotly import tools
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.figure_factory as ff

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

Overall post structure

  • When and why doesn't CLT (and normal dist) apply? (https://towardsdatascience.com/kolmogorov-smirnov-test-84c92fb4158d)
  • It is incredibly important to know when theorems and equations DO NOT HOLD
  • Taleb -> the bell curve, the great intellectual fraud, the spectator and the prostitute (chapter 3)
  • Model thinker -> power law chapter and normal dist chapter
  • When there is a lack of independence, or when probability (even if uniform) corresponds to VERY different outputs
  • Expected value based on law of large numbers
  • Arrow problem
  • arrow problem, theoretical expectation (will need to take a limit!)

What can we do about it?

  • KS Score
  • What is it? (explanation done, still need lots of edits, particularly the inclusion of links, theory)
  • Create an experiment runner (look at different distributions in order to see what their KS score would be)

Appendix

When will CLT apply and when will it not?

The Central Limit Theorem: tells us that as the sample size tends to infinity, the distribution of sample means approaches the normal distribution. This is a statement about the SHAPE of the distribution. A normal distribution is bell shaped so the shape of the distribution of sample means begins to look bell shaped as the sample size increases. https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading6b.pdf. TODO: show that the original variable is NOT normally distributed (it is a yes/no success outcome-i.e. the basketball follows a bernoulli distribution). https://www.youtube.com/watch?v=Pujol1yC1_A.

The Law of Large Numbers: tells us where the center (maximum point) of the bell is located. Again, as the sample size approaches infinity the center of the distribution of the sample means becomes very close to the population mean. Very informally, it states that when you do an experiment a repeated number of times, add up the outcomes, and take the average, it looks like the average of the distribution (link to: https://www.nathanieldake.com/Mathematics/04-Statistics-01-Introduction.html). Again, if you do a large number of experiments, the average result is the average of your distribution. As number of trials increase, we approach the average.

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer to the expected value as more trials are performed.

Example where it will apply

Let's look at an example where it will apply (show both trials vs. value, as well as histogram of a few different numbers of trials). Consider the following toy example: There is a basketball player who has a %72.3 free throw success rate. If he takes 50 free throws, how many do we expect him to make? The guess that maximizes the likelihood would be 50 * .723 = 36.15. Now, if we consider a single trial to be our player taking 50 free throws, what happens to the average of our outcomes as our number of trials increases?

TODO: Explain that basketball shots are not normally distributed, in this case they are drawn from a bernoulli distribution.

In [2]:
def single_trial(shots_per_trial=50):
    """Returns the number of successful shots made in `shots_per_trial` attempts."""
    
    success_rate = 0.723
    outcome_prob = np.random.uniform(size=shots_per_trial)
    outcome_value = outcome_prob < success_rate
    
    return outcome_value.sum()
  
    
def run_trials(num_trials=10_000, shots_per_trial=50):
    """Performs `num_trials` of single_trial() and returns an array of the outcomes."""
    
    trial_outcomes = []
    
    for i in range(num_trials):
        baskets_made_in_single_trial = single_trial(shots_per_trial=shots_per_trial)
        trial_outcomes.append(baskets_made_in_single_trial)
        
    return np.asarray(trial_outcomes)


def calculate_cumulative_means(trial_outcomes):
    """Calculates cumulative means based on list of trial outcomes."""
    
    cumulative_means = []
    
    for i in range(1, len(trial_outcomes)):
        subset_mean = trial_outcomes[:i].mean()
        cumulative_means.append(subset_mean)
        
    return cumulative_means
        

def run_experiment(num_trials=10_000, shots_per_trial=50):
    """Orchestrates experiment by running trials and calculating cumulative means."""
    
    trial_outcomes = run_trials(num_trials=num_trials, shots_per_trial=shots_per_trial)
    trials_index = np.arange(0, trial_outcomes.shape[0], 1)
    
    cumulative_means = calculate_cumulative_means(trial_outcomes)
    
    return trial_outcomes, trials_index, cumulative_means   
In [3]:
trial_outcomes, trials_index, cumulative_means = run_experiment(num_trials=10_000, shots_per_trial=50)

First let's look at a plot that shows the cumulative average-notice that it approaches the average (based on MLE-link to derivation of MLE for bernoulli experiment, or derive it). This is the law of large numbers:

In [6]:
trace1 = go.Scatter(
    x=trials_index,
    y=cumulative_means,
    marker = dict(color='crimson'),
    fillcolor='rgba(0, 0, 255, 0.2)',
    name="Sample Cumulative Average"
)

trace2 = go.Scatter(
    x=trials_index,
    y=np.ones_like(trials_index)*36.15,
    marker = dict(color='blue'),
    fillcolor='rgba(0, 0, 255, 0.2)',
    name="Theoretical Expected Value"
)

data = [trace1, trace2]

layout = go.Layout(
    width=800,
    height=400,
    title="Cumulative average # shots made in trials vs. number of trials",
    xaxis=dict(title="Trials"),
    yaxis=dict(title="Cumulative average # shots made"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig = go.Figure(data=data, layout=layout)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

fig.update_layout(layout)

# plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))

Now, this also relates to the central limit theorem, which states that as our sample size increases our distribution will approach the normal distribution:

In [91]:
trial_outcomes, trials_index, cumulative_means = run_experiment(num_trials=100_000, shots_per_trial=50)
In [92]:
trace1 = go.Histogram(
    x=trial_outcomes[:100],
    nbinsx=30,
    name="Sample 1",
    histnorm='probability density',
    marker_color='crimson',
    opacity=0.7,
    showlegend=False
)

trace2 = go.Histogram(
    x=trial_outcomes[:1_000],
    nbinsx=30,
    name="Sample 2",
    histnorm='probability density',
    marker_color='crimson',
    opacity=0.7,
        showlegend=False
)

trace3 = go.Histogram(
    x=trial_outcomes[:10_000],
    nbinsx=30,
    name="Sample 3",
    histnorm='probability density',
    marker_color='crimson',
    opacity=0.7,
    showlegend=False
)

trace4 = go.Histogram(
    x=trial_outcomes[:100_000],
    nbinsx=30,
    name="Sample 4",
    histnorm='probability density',
    marker_color='crimson',
    opacity=0.7,
    showlegend=False
)

fig = plotly.subplots.make_subplots(
    rows=1,
    cols=4,
    print_grid=False,
    subplot_titles=("100 Trials", "1,000 Trials", "10,000 Trials", "100,000 Trials")
)

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig.append_trace(trace4, 1, 4)

fig.update_traces(marker=dict(line=dict(width=1, color='black')))
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

layout = go.Layout(
    barmode='overlay',
    width=950,
    height=300,
    xaxis1=dict(title="X"),
    yaxis=dict(title="Probabilty Density"),
    xaxis2=dict(title='X'),
    xaxis3=dict(title="X"),
    xaxis4=dict(title="X"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig.update_layout(layout)

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))

We can clearly see that as the number of trials increases, our distribution starts to look more and more like the traditional bell curve. Remember, the law of large numbers stated that as the number of trials increased, the mean of the outcomes should approach the expected value of the distribution (which was demonstrated in earlier plot).

We can give a bit more detail in showing as the number of trials increase we do indeed approach the normal distribution. Below, we can see an overlay of the normal distribution against the our histograms:

In [93]:
def overlay_normal_dist(x_data, hist_trace):
    mu, std = norm.fit(x_data)
    x_min = hist_trace.x.min()
    x_max = hist_trace.x.max()
    x_axis = np.linspace(x_min, x_max, 100)
    norm_pdf = norm.pdf(x_axis, mu, std)
    
    return x_axis, norm_pdf
In [97]:
def experiment_sample_size(data={}, sample_size=10):
    if len(data) > 0:
        trial_outcomes = data['trial_outcomes']
        trials_index = data['trials_index']
        cumulative_means = data['cumulative_means']
    else:
        trial_outcomes, trials_index, cumulative_means = run_experiment(num_trials=100_000, shots_per_trial=sample_size)
    
    trace1a = go.Histogram(
        x=trial_outcomes[:100],
        nbinsx=30,
        name="Sample 1",
        histnorm='probability density',
        marker_color='crimson',
        opacity=0.7,
        showlegend=False
    )

    x_axis, norm_pdf = overlay_normal_dist(trial_outcomes[:100], trace1a)
    trace1b = go.Scatter(
        x=x_axis,
        y=norm_pdf,
        marker = dict(color='blue'),
        fillcolor='rgba(0, 0, 255, 0.2)',
        name="Normal Distribution (fit to data)",
        showlegend=False
    )

    trace2a = go.Histogram(
        x=trial_outcomes[:1_000],
        nbinsx=30,
        name="Sample 2",
        histnorm='probability density',
        marker_color='crimson',
        opacity=0.7,
        showlegend=False
    )

    x_axis, norm_pdf = overlay_normal_dist(trial_outcomes[:1_000], trace2a)
    trace2b = go.Scatter(
        x=x_axis,
        y=norm_pdf,
        marker = dict(color='blue'),
        fillcolor='rgba(0, 0, 255, 0.2)',
        name="Normal Distribution \n (fit to data)",
        showlegend=False
    )

    trace3a = go.Histogram(
        x=trial_outcomes[:10_000],
        nbinsx=30,
        name="Sample 3",
        histnorm='probability density',
        marker_color='crimson',
        opacity=0.7,
        showlegend=False
    )

    x_axis, norm_pdf = overlay_normal_dist(trial_outcomes[:10_000], trace3a)
    trace3b = go.Scatter(
        x=x_axis,
        y=norm_pdf,
        marker = dict(color='blue'),
        fillcolor='rgba(0, 0, 255, 0.2)',
        name="Normal Distribution (fit to data)",
        showlegend=False
    )

    trace4a = go.Histogram(
        x=trial_outcomes[:100_000],
        nbinsx=30,
        name="Sample 4",
        histnorm='probability density',
        marker_color='crimson',
        opacity=0.7,
        showlegend=False
    )

    x_axis, norm_pdf = overlay_normal_dist(trial_outcomes[:100_000], trace4a)
    trace4b = go.Scatter(
        x=x_axis,
        y=norm_pdf,
        marker = dict(color='blue'),
        fillcolor='rgba(0, 0, 255, 0.2)',
        name="Normal Distribution (fit to data)",
        showlegend=False
    )

    fig = plotly.subplots.make_subplots(
        rows=1,
        cols=4,
        print_grid=False,
        subplot_titles=("100 Trials", "1,000 Trials", "10,000 Trials", "100,000 Trials")
    )

    fig.append_trace(trace1a, 1, 1)
    fig.append_trace(trace1b, 1, 1)
    fig.append_trace(trace2a, 1, 2)
    fig.append_trace(trace2b, 1, 2)
    fig.append_trace(trace3a, 1, 3)
    fig.append_trace(trace3b, 1, 3)
    fig.append_trace(trace4a, 1, 4)
    fig.append_trace(trace4b, 1, 4)

    fig.update_traces(marker=dict(line=dict(width=1, color='black')))
    fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

    layout = go.Layout(
        barmode='overlay',
        width=950,
        height=300,
        xaxis1=dict(title="X"),
        yaxis=dict(title="Probabilty Density"),
        xaxis2=dict(title='X'),
        xaxis3=dict(title="X"),
        xaxis4=dict(title="X"),
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)'
    )

    fig.update_layout(layout)

    return plotly.offline.iplot(fig)
    # html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
    # display(HTML(html_fig))
In [98]:
data_dict = {
    'trial_outcomes': trial_outcomes,
    'trials_index': trials_index,
    'cumulative_means': cumulative_means,
}
experiment_sample_size(data_dict, sample_size=None)

We can see above that as the number of trials increases our histogram looks more and more like the normal distribution (in other words it is approaching the normal distribution). We can see that as the number of trials increase, we also converge towards the average value (if I extended the computation to 1,000,000 trials the convergence would be even more pronounced).

Why does law of large numbers and CLT hold here?

  • We are dealing with binary outcomes (he made it or he didn't, Black swan page: 244).
  • A lot of "cancelling" (BS page: 231, 234, 245, 246)
  • In situations like this, one observation cannot effect the total!
  • The reason one observation cannot effect the total is because each basket has a value/weight of 1! This means that it can only add or subtract 1 to the total. If this wasn't the case then the CLT may not hold. However, note that in arrow example below law of large numbers will hold, however, the mean is infinity and it will approach that!

TODO: Note that we have been using a sample size of 50 trials for the CLT. Potentially may want to run an experiment showing how the distribution changes when we have a fixed number of trials (say 1,000), but a varying sample size (say, 10, 25, 50, and 100). That may be helpful to demonstrate the CLT here.

Now, what happens if our trial size (in reality a sample size) decreases from 50 to 10? Let's take a look:

In [105]:
trial_outcomes, trials_index, cumulative_means = run_experiment(num_trials=100_000, shots_per_trial=10)
In [106]:
trace1 = go.Scatter(
    x=trials_index,
    y=cumulative_means,
    marker = dict(color='crimson'),
    fillcolor='rgba(0, 0, 255, 0.2)',
    name="Sample Cumulative Average"
)

trace2 = go.Scatter(
    x=trials_index,
    y=np.ones_like(trials_index)*7.23,
    marker = dict(color='blue'),
    fillcolor='rgba(0, 0, 255, 0.2)',
    name="Theoretical Expected Value"
)

data = [trace1, trace2]

layout = go.Layout(
    width=800,
    height=400,
    title="Cumulative average # shots made in trials vs. number of trials (Trial size = 10)",
    xaxis=dict(title="Trials"),
    yaxis=dict(title="Cumulative average # shots made"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig = go.Figure(data=data, layout=layout)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

fig.update_layout(layout)

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
In [107]:
data_dict = {
    'trial_outcomes': trial_outcomes,
    'trials_index': trials_index,
    'cumulative_means': cumulative_means,
}
experiment_sample_size(data_dict, sample_size=None)

We can then even run an experiment based on sample size: We just looked at a sample size (number of shots per trial) of 10. Let's try 20 now:

In [108]:
experiment_sample_size(sample_size=20)

And now we can look at a sample size of 30:

In [109]:
experiment_sample_size(sample_size=30)

Clearly, as the number of shots taken per trial increases (i.e. our sample size increases), we start to approach the normal distribution more closely.

TODO: Explain why if it is sample size or total number of trials that produces the normal dist.

  • The more samples we take, the more the graphed results will resemble the normal distribution
  • the central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution (also known as a “bell curve”), as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population distribution shape.
  • So, in reality it is the sample size (number of shots taken in a single trial) that is the determining factor. Plotting more trials just allows for us to have enough data for the normal distribution to appear

Example where it will not apply

Arrow shot at wall.

Why does it not apply?

TODO

When thinking about the arrow at the wall problem, remember the same thing that we ran into earlier-> If we want to eventually show 1,000,000 trials, we only need to calculate 1 million in total, and then we can iteratively step through and grab an additional trial at each iteration, using that in the total.

May want to run a few different 1,000,000 size trials and plot their distribution?

TODO: Diagrams for arrow problem.

In [2]:
def arrow_trials(arrows=50):
    """Returns mean distance of arrow shot over the course of `arrows` shots."""
    
    angles = np.random.uniform(size=arrows) * (np.pi / 2)
    distances = np.tan(angles)
    mean_distance = distances.mean()
    max_distance = distances.max()

    return distances, mean_distance, max_distance


def run_trials(num_trials=10_000, shots_per_trial=50):
    """Performs `num_trials` of arrow_trials() and returns an array of the outcomes."""
    
    trial_outcomes = []
    
    for i in range(num_trials):
        baskets_made_in_single_trial = single_trial(shots_per_trial=shots_per_trial)
        trial_outcomes.append(baskets_made_in_single_trial)
        
    return np.asarray(trial_outcomes)


def calculate_cumulative_means(trial_outcomes):
    """Calculates cumulative means based on list of trial outcomes."""
    
    cumulative_means = np.cumsum(trial_outcomes) / np.arange(1, trial_outcomes.shape[0] + 1, 1)
    return cumulative_means


def run_experiment(num_trials=10_000, shots_per_trial=50):
    """Orchestrates experiment by running trials and calculating cumulative means."""
    
    trial_outcomes = run_trials(num_trials=num_trials, shots_per_trial=shots_per_trial)
    trials_index = np.arange(0, trial_outcomes.shape[0], 1)
    
    cumulative_means = calculate_cumulative_means(trial_outcomes)
    
    return trial_outcomes, trials_index, cumulative_means   
In [3]:
distances, _, _ = arrow_trials(arrows=100_000)
cumulative_mean_distances = calculate_cumulative_means(distances)
x_axis = np.arange(0, distances.shape[0], 1)
In [ ]:
trace1 = go.Scatter(
    x=x_axis,
    y=cumulative_mean_distances,
    marker = dict(color='crimson'),
    fillcolor='rgba(0, 0, 255, 0.2)',
    name="Sample Cumulative Average"
)

data = [trace1]

layout = go.Layout(
    width=800,
    height=400,
    title="Cumulative average arrow distance vs. number of arrows shot",
    xaxis=dict(title="Number of arrows shot"),
    yaxis=dict(title="Cumulative average arrow distance"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig = go.Figure(data=data, layout=layout)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

fig.update_layout(layout)

# plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
In [ ]:
distances, _, _ = arrow_trials(arrows=1_000_000)
cumulative_mean_distances = calculate_cumulative_means(distances)
x_axis = np.arange(0, distances.shape[0], 1)

trace1 = go.Scatter(
    x=x_axis[::10],
    y=cumulative_mean_distances[::10],
    marker = dict(color='crimson'),
    fillcolor='rgba(0, 0, 255, 0.2)',
    name="Sample Cumulative Average"
)

data = [trace1]

layout = go.Layout(
    width=800,
    height=400,
    title="Cumulative average arrow distance vs. number of arrows shot",
    xaxis=dict(title="Number of arrows shot"),
    yaxis=dict(title="Cumulative average arrow distance"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig = go.Figure(data=data, layout=layout)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

fig.update_layout(layout)

# plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
In [6]:
%%time
distances, _, _ = arrow_trials(arrows=10_000_000)
cumulative_mean_distances = calculate_cumulative_means(distances)
x_axis = np.arange(0, distances.shape[0], 1)
CPU times: user 359 ms, sys: 76.2 ms, total: 435 ms
Wall time: 457 ms
In [ ]:
trace1 = go.Scatter(
    x=x_axis[::100],
    y=cumulative_mean_distances[::100],
    marker = dict(color='crimson'),
    fillcolor='rgba(0, 0, 255, 0.2)',
    name="Sample Cumulative Average"
)

data = [trace1]

layout = go.Layout(
    width=800,
    height=400,
    title="Cumulative average arrow distance vs. number of arrows shot",
    xaxis=dict(title="Number of arrows shot"),
    yaxis=dict(title="Cumulative average arrow distance"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig = go.Figure(data=data, layout=layout)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

fig.update_layout(layout)

# plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
In [8]:
%%time
distances, _, _ = arrow_trials(arrows=100_000_000)
cumulative_mean_distances = calculate_cumulative_means(distances)
x_axis = np.arange(0, distances.shape[0], 1)
CPU times: user 4.35 s, sys: 2.52 s, total: 6.87 s
Wall time: 9.17 s
In [ ]:
trace1 = go.Scatter(
    x=x_axis[::1000],
    y=cumulative_mean_distances[::1000],
    marker = dict(color='crimson'),
    fillcolor='rgba(0, 0, 255, 0.2)',
    name="Sample Cumulative Average"
)

data = [trace1]

layout = go.Layout(
    width=800,
    height=400,
    title="Cumulative average arrow distance vs. number of arrows shot",
    xaxis=dict(title="Number of arrows shot"),
    yaxis=dict(title="Cumulative average arrow distance"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig = go.Figure(data=data, layout=layout)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

fig.update_layout(layout)

# plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
In [10]:
data = []

for i in range(10):
    distances, _, _ = arrow_trials(arrows=10_000_000)
    cumulative_mean_distances = calculate_cumulative_means(distances)
    x_axis = np.arange(0, distances.shape[0], 1)
    
    trace = go.Scatter(
        x=x_axis[::100],
        y=cumulative_mean_distances[::100],
        fillcolor='rgba(0, 0, 255, 0.2)',
        name="Sample Cumulative Average"
    )
    
    data.append(trace)

layout = go.Layout(
    width=800,
    height=400,
    title="Cumulative average arrow distance vs. number of arrows shot",
    xaxis=dict(title="Number of arrows shot"),
    yaxis=dict(title="Cumulative average arrow distance"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig = go.Figure(data=data, layout=layout)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

fig.update_layout(layout)

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
    
In [ ]:
# TODO: Utilize the same type of plot in earlier example to show that all paths converge! 
In [165]:
# TODO:
# consider plotting max values to show how widely they vary (compared to simple yes/no outcome experiments)
# Greater than 1 million it may make sense not to do cumulative mean...
In [178]:
distances, mean_dist, max_dist = arrow_trials(arrows=100_000_000)
In [179]:
mean_dist
Out[179]:
10.85443378257162
In [180]:
max_dist
Out[180]:
41177706.0465903

What is the KS Score

  • Touch on data distributions (we can just use normal for the first example, should also use non normal)
    • For the example we know the data generating process, in general (obviously), we DO NOT!
    • In this case the data generating process is created a normally distributed sample of of size 200, with mean 0 and mean 1, both have variance 1
  • Display their CDFs
  • Walk through the technical implementation details of finding the difference in CDFs
  • TODO: P value/probability calculation
In [71]:
sample_size = 200

mean1 = 0
mean2 = 1
var1 = 1
var2 = 1

samp1 = np.random.normal(mean1, var1, sample_size)
samp2 = np.random.normal(mean2, var2, sample_size)
In [111]:
trace1a = go.Histogram(
    x=samp1,
    nbinsx=20,
    name="Sample 1",
    marker_color='blue',
    opacity=0.7
)

trace2a = go.Histogram(
    x=samp2,
    nbinsx=20,
    name="Sample 2",
    marker_color='crimson',
    opacity=0.7
)

trace1b = go.Histogram(
    x=samp1,
    nbinsx=20,
    name="Sample 1",
    marker_color='blue',
    opacity=0.7,
    cumulative_enabled=True,
    showlegend=False
)

trace2b = go.Histogram(
    x=samp2,
    nbinsx=20,
    name="Sample 2",
    marker_color='crimson',
    opacity=0.7,
    cumulative_enabled=True,
    showlegend=False
)

fig = plotly.subplots.make_subplots(
    rows=1,
    cols=2,
    print_grid=False,
    subplot_titles=('Sample 1 & 2 Histograms', 'Sample 1 & 2 Cumulative Histograms')
)

fig.append_trace(trace1a, 1, 1)
fig.append_trace(trace2a, 1, 1)
fig.append_trace(trace1b, 1, 2)
fig.append_trace(trace2b, 1, 2)

fig.update_traces(marker=dict(line=dict(width=1, color='black')))
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

layout = go.Layout(
    barmode='overlay',
    width=950,
    height=450,
    xaxis1=dict(title="X"),
    yaxis=dict(title="Count"),
    xaxis2=dict(title='X'),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig.update_layout(
    layout
)

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
In [128]:
x_axis = np.arange(-3, 4, 0.01)

y_samp1_pdf = norm.pdf(x_axis, mean1, var1)
y_samp2_pdf = norm.pdf(x_axis, mean2, var2)
y_samp1_cdf = norm.cdf(x_axis, mean1, var1)
y_samp2_cdf = norm.cdf(x_axis, mean2, var2)

trace1a = go.Scatter(
    x=x_axis,
    y=y_samp1_pdf,
    marker = dict(color='blue'),
    fill='tozeroy',
    fillcolor='rgba(0, 0, 255, 0.2)',
    name="Sample 1"
)

trace2a = go.Scatter(
    x=x_axis,
    y=y_samp2_pdf,
    marker = dict(color='red'),
    fill='tozeroy',
    fillcolor='rgba(255, 0, 0, 0.2)',
    name="Sample 2"
)

trace1b = go.Scatter(
    x=x_axis,
    y=y_samp1_cdf,
    marker = dict(color='blue'),
    name="Sample 1",
    showlegend=False

)

trace2b = go.Scatter(
    x=x_axis,
    y=y_samp2_cdf,
    marker = dict(color='red'),
    name="Sample 2",
    showlegend=False
)

fig = plotly.subplots.make_subplots(
    rows=1,
    cols=2,
    print_grid=False,
    subplot_titles=('Sample 1 & 2 PDFs', 'Sample 1 & 2 CDFs')
)

fig.append_trace(trace1a, 1, 1)
fig.append_trace(trace2a, 1, 1)
fig.append_trace(trace1b, 1, 2)
fig.append_trace(trace2b, 1, 2)

fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

layout = go.Layout(
    width=950,
    height=450,
    xaxis1=dict(title="X"),
    yaxis1=dict(title="Probability Density"),
    xaxis2=dict(title='X'),
    yaxis2=dict(title="Cumulative Probability"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig.update_layout(
    layout
)

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
In [117]:
def empirical_cdf(x):
    x.sort()
    return x, np.arange(0, x.shape[0], 1) / (x.shape[0] - 1)
In [127]:
x_axis_samp1, emp_cdf_samp1 = empirical_cdf(samp1)
x_axis_samp2, emp_cdf_samp2 = empirical_cdf(samp2)

trace1 = go.Scatter(
    x=x_axis_samp1,
    y=emp_cdf_samp1,
    marker = dict(color='blue'),
    name="Sample 1"
)

trace2 = go.Scatter(
    x=x_axis_samp2,
    y=emp_cdf_samp2,
    marker = dict(color='red'),
    name="Sample 2"
)

data = [trace1, trace2]

layout = go.Layout(
    width=650,
    height=400,
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    title="Empirical CDF",
    xaxis=dict(title="X"),
    yaxis=dict(title="Cumulative Probability")
)

fig = go.Figure(data=data, layout=layout)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))

TODO: Explain how empirical_cdf works: https://en.wikipedia.org/wiki/Empirical_distribution_function

The problem that we run into is that our x values are not the same for each distribution! This actually comes back to the way in which distributions differ compared to standard continuous functions. A continuous distribution function in reality is an abstraction (compared to a histogram or discrete distribution) that would not occur in the real world. It is meant to capture the probability of certain x values of occuring; this probability is a function of the x values; some are more probable than others. The key here to keep in mind is that when we gather data we do not observe a nice continuous spectrum of x values. We only get a certain discrete number. Hence, in our plot above, there are many (if not all) x values that occur in sample 1 but not sample 2.

This is actually a very interesting thing to keep in mind with functions. We generally consider them as taking a continuous input, x, (i.e. uniformly distributed), and determining a response for that value. Distribution functions are different in the sense that their input, by definition, does not need to be uniformly distributed! It can take on any sort of distribution/spread. The response is (y axis) is meant to capture how probable (based on the entire input) a given input value is. In other words, for a distribution function to work successfully it must see all of the data ahead of time.

We can then abstract from these messy distributions to more theoretical ones, such as the normal. Touch on taleb here.

Now, when it comes to our implementation, we need to figure out a way to computationally compare the two above empirical CDFS. Let's look at how we may do that.

KS Score implementation

In [157]:
ks_2samp(samp1, samp2)
Out[157]:
Ks_2sampResult(statistic=0.455, pvalue=4.908580954287128e-19)

Above is the actual KS score. How may we get there? At first it may seem like we could simply subtract the two curves above, let's give that shot!

In [158]:
diff = emp_cdf_samp1 - emp_cdf_samp2
In [160]:
diff
Out[160]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

It looks like according to that calculation our curves are identical-but based on visual inspection (and the fact that we know the generating functions) this is incorrect. Where did we go wrong? Well all we did was take the differences of the two curves y values. However, we never checked to see if the x values actually matched up (remember, we want to take the difference between the cdfs at the same point in x).

Let's see where we went wrong by looking at the index 50:

In [171]:
idx = 50
emp_cdf_samp1[idx]
Out[171]:
0.25125628140703515
In [166]:
emp_cdf_samp2[idx]
Out[166]:
0.25125628140703515

We can see that at an index of 50 our empirical cdf's contain the same value. The problem is, while we generally treat our index to be similar to our x value, in this case they do not match up! Let's take a look:

In [167]:
x_axis_samp1[idx]
Out[167]:
-0.7616459973200506
In [168]:
x_axis_samp2[idx]
Out[168]:
0.372598825869213

So in other words we just subtracted the empirical cdf of sample 1 at x = -7.616 from the empirical cdf of sample 2 at x = 0.372. Visually, what we did is shown below:

In [178]:
x_axis_samp1, emp_cdf_samp1 = empirical_cdf(samp1)
x_axis_samp2, emp_cdf_samp2 = empirical_cdf(samp2)

trace1 = go.Scatter(
    x=x_axis_samp1,
    y=emp_cdf_samp1,
    marker = dict(color='blue'),
    name="Sample 1"
)

trace2 = go.Scatter(
    x=x_axis_samp2,
    y=emp_cdf_samp2,
    marker = dict(color='red'),
    name="Sample 1"
)

trace3 = go.Scatter(
    x=[x_axis_samp1[idx]],
    y=[emp_cdf_samp1[idx]],
    marker = dict(color = 'blue',size=10),
    name="Sample 1, index=50",
    mode='markers'
)

trace4 = go.Scatter(
    x=[x_axis_samp2[idx]],
    y=[emp_cdf_samp2[idx]],
    marker = dict(color = 'red',size=10),
    name="Sample 2, index=50",
    mode='markers'
)

data = [trace1, trace2, trace3, trace4]

layout = go.Layout(
    width=650,
    height=400,
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    title="Empirical CDF",
    xaxis=dict(title="X"),
    yaxis=dict(title="Cumulative Probability")
)

fig = go.Figure(data=data, layout=layout)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='#f1f1f1')

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))

So, we clearly can see above that we simply subtracted the y values at the incorrect index. We chose a common index (50) for each, but that clearly did not match up correctly. In reality we want to chose a common x value and subtract the empirical cdfs at that point.

What is a better way that we could go about this? Well, a place to start is to realize that in reality we want to calculate:

$$\hat{F}_n(t) = \frac{\text{number of elements in the sample } \leq t}{n} = \frac{1}{n} \sum_{i=1}^n \textbf{1}_{X_i \leq t}$$

Where $\textbf{1}$ is an indicator function, and our sample contains $X_1,\dots,X_n$. When I had plotted above, I have calculated $\hat{F}$ for $x \in X$.

That defines the empirical distribution. It also allows us to the see why the actual y values for the empirical CDFs above were identical; because the cumulative probability is only a function of the number of data points in the sample less than or equal to the current data point. In other words, it is a step function that jumps by $\frac{1}{n}$ at each of the $n$ data points. Now, in this empirical distribution, intuition tells us that when we have a large slow (large rise over a small run), there is a density of points in that range. Like wise a small slope corresponds to a low density of points in that range. Does this make sense? YES! Remember that the CDF is the integral of the PDF, and likewise the PDF is the derivative of the CDF! So, where the slope is steepest above we would expect a peak (mode of the distribution) and indeed this is what we see!

Now, based on this function, can we write a helper function to determine the empirical distribution value of a single input? Indeed we can!

In [179]:
def empirical_dist_helper(sample, t):
    n = sample.shape[0]
    return np.searchsorted(sample, t, side='right') / n
In [181]:
empirical_dist_helper(np.array([3,4,10,20]), 15)
Out[181]:
0.75

Now that we have that helper, which utilizes np.searchsorted, how can we make the next jump and use this to compare two empirical cdfs? Well, in reality what we want to do is determine, based on all of our data (i.e. the concatenation of our samples-this defines our entire sample space), what $\hat{F}$ evaluates two for each sample. An example will make this more clear.

Let's look at the 90th $x$ point in sample1:

In [183]:
samp1[90]
Out[183]:
-0.10819431828759411

Let's use empirical_dist_helper in order to help us find $\hat{F}$ for each sample:

In [190]:
x_90 = samp1[90]
empirical_dist_helper(samp1, x_90)
Out[190]:
0.455
In [192]:
empirical_dist_helper(samp2, x_90)
Out[192]:
0.115

What we did was just calculated the empirical distribution value for a given input, x_90 = -0.108..., for both samp1 and samp2.

Now, our final objective is to calculate the empirical distribution value for all possible values of our sample space! In other words, we want to take all values in samp1 and samp2 and run them through empirical_dist_helper. Thankfully np.searchsorted can take a list comparators. A final implementation will look like:

In [195]:
# DOCS: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L4029

def ks_score_implementation(x1, x2):
    # Sort and combine all data into total input space
    data1 = np.sort(x1)
    data2 = np.sort(x2)
    data_all = np.concatenate([data1, data2])
    
    # Determine 'cdf' of each sample, based on entire input space. Note, this is not a traditional CDF.
    # cdf1_data and cdf2_data simply are arrays that hold the value of F_hat based on identical inputs 
    # from data_all
    cdf1_data = empirical_dist_helper(data1, data_all)
    cdf2_data = empirical_dist_helper(data2, data_all)
    
    cdf_diffs = cdf1_data - cdf2_data
    minS = -np.min(cdf_diffs)
    maxS = np.max(cdf_diffs)
    
    d = max(minS, maxS)
    
    return d
In [197]:
d = ks_score_implementation(samp1, samp2)
In [203]:
d
Out[203]:
0.45500000000000007

And below I expanded the running of empirical_dist_helper across all points in our input space, data_all, but utilizing a for loop:

In [201]:
def ks_score_implementation_expanded(x1, x2):
    # Sort and combine all data into total input space
    data1 = np.sort(x1)
    data2 = np.sort(x2)
    data_all = np.concatenate([data1, data2])
    
    # Expanding CDF implementation 
    cdf1_data = []
    cdf2_data = []
    for point in data_all: 
        cdf1_data.append(empirical_dist_helper(data1, point))
        cdf2_data.append(empirical_dist_helper(data2, point))
    
    cdf_diffs = np.asarray(cdf1_data) - np.asarray(cdf2_data)
    minS = -np.min(cdf_diffs)
    maxS = np.max(cdf_diffs)
    
    d = max(minS, maxS)
    
    return d
In [202]:
ks_score_implementation_expanded(samp1, samp2)
Out[202]:
0.45500000000000007

Remember, when doing the search sorted we are literally finding what the CDF value would be of each data point, across the entire data set. So, in essence we:

  • Take the entire data set (concatenated togetether)
  • For each data point in the entire data set, one by one, determine it's cumulative probability of occuring in sample 1 and sample 2. Store this value in a CDF sample 1 and CDF sample 2 array
  • After doing this for each point, we now can subtract arrays element wise to find our max diff

The key idea here is that these two CDF sample arrays are NOT traditional CDFs! They are simply lists that are meant to hold the values of cumulative probability of data points from our total data set. They are not CDFs in the traditional sense. Our goal here is not to plot-we are not creating a uniformly distributed x axis and passing that in to some function (via. np.arange, np.linspace, etc).


© 2018 Nathaniel Dake